Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerator model state dict #7474

Merged

Conversation

shuyingsunshine21
Copy link
Contributor

@shuyingsunshine21 shuyingsunshine21 commented May 10, 2021

What does this PR do?

Fixes #7470

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Shuying Sun and others added 30 commits March 23, 2021 12:06
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
…oint_consolidate

Update test_all_gather_grad.py
…1-checkpoint_consolidate"

This reverts commit c5053da, reversing
changes made to 0d23d75.
This reverts commit 70fe5da.
This reverts commit a9aae99.
@codecov
Copy link

codecov bot commented May 10, 2021

Codecov Report

Merging #7474 (adec445) into master (f6fe715) will decrease coverage by 4%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #7474    +/-   ##
=======================================
- Coverage      92%     88%    -4%     
=======================================
  Files         199     199            
  Lines       13030   13057    +27     
=======================================
- Hits        11989   11431   -558     
- Misses       1041    1626   +585     

Copy link
Contributor

@SeanNaren SeanNaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awaelchli awaelchli added distributed Generic distributed-related topic feature Is an improvement or enhancement labels May 10, 2021
@awaelchli awaelchli added this to the v1.4 milestone May 10, 2021
Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you see a good way to unit test the state dict being exposed from the accelerator?

@@ -420,6 +420,12 @@ def optimizer_state(self, optimizer: Optimizer) -> Dict[str, Tensor]:
"""
return getattr(self.training_type_plugin, 'optimizer_state', lambda x: x.state_dict())(optimizer)

def state_dict(self) -> Dict[str, Union[Any, Tensor]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the union with tensor? same for L429

Suggested change
def state_dict(self) -> Dict[str, Union[Any, Tensor]]:
def state_dict(self) -> Dict[str, Any]:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I just follow the same type defined below for on_save, but I think you are right, Dict[str, Any] should be enough

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananthsub I added the Union here originally, I think. This was done since it often is a tensor, but doesn't have to. And if you omit the tensor here, IDEs don't show the autocomplete-suggestions for Tensors

@shuyingsunshine21
Copy link
Contributor Author

do you see a good way to unit test the state dict being exposed from the accelerator?

will check if we already have some existing test for state dict, if not, i will come up with something.

@@ -420,6 +420,12 @@ def optimizer_state(self, optimizer: Optimizer) -> Dict[str, Tensor]:
"""
return getattr(self.training_type_plugin, 'optimizer_state', lambda x: x.state_dict())(optimizer)

def state_dict(self) -> Dict[str, Union[Any, Tensor]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananthsub I added the Union here originally, I think. This was done since it often is a tensor, but doesn't have to. And if you omit the tensor here, IDEs don't show the autocomplete-suggestions for Tensors

@SeanNaren SeanNaren merged commit 8538c1f into Lightning-AI:master May 11, 2021
@shuyingsunshine21 shuyingsunshine21 deleted the accelerator_model_state_dict branch May 11, 2021 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Generic distributed-related topic feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API change, expose model's state_dict to accelerator.training_type_plugin
5 participants